.. _`K-means Clustering`: .. _`org.sysess.sympathy.machinelearning.k_means`: K-means Clustering ================== .. image:: dataset_blobs.svg :width: 48 Clusters data by trying to separate samples in n groups of equal variance **Documentation** Clusters data by trying to separate samples in n groups of equal variance *Configuration*: - *n_clusters* The number of clusters to form as well as the number of centroids to generate. - *n_init* Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. - *init* Method for initialization: 'k-means++' : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details. 'random': choose `n_clusters` observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization. - *algorithm* K-means algorithm to use. The classical EM-style algorithm is "full". The "elkan" variation is more efficient on data with well-defined clusters, by using the triangle inequality. However it's more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters). For now "auto" (kept for backward compatibiliy) chooses "elkan" but it might change in the future for a better heuristic. .. versionchanged:: 0.18 Added Elkan algorithm - *max_iter* Maximum number of iterations of the k-means algorithm for a single run. - *tol* Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. - *precompute_distances* Precompute distances (faster but takes more memory). 'auto' : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision. True : always precompute distances. False : never precompute distances. .. deprecated:: 0.23 'precompute_distances' was deprecated in version 0.22 and will be removed in 0.25. It has no effect. - *n_jobs* The number of OpenMP threads to use for the computation. Parallelism is sample-wise on the main cython loop which assigns each sample to its closest center. ``None`` or ``-1`` means using all processors. .. deprecated:: 0.23 ``n_jobs`` was deprecated in version 0.23 and will be removed in 0.25. - *random_state* Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See random_state. *Attributes*: - *cluster_centers_* Coordinates of cluster centers. If the algorithm stops before fully converging (see ``tol`` and ``max_iter``), these will not be consistent with ``labels_``. - *labels_* Labels of each point - *inertia_* Sum of squared distances of samples to their closest cluster center. *Input ports*: *Output ports*: **model** : model Model **Definition** *Input ports* *Output ports* :model: model Model .. automodule:: node_clustering .. class:: KMeansClustering